Genetic Correlations in Mutation Processes 
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We study the role of phylogenetic trees on correlations in mutation processes. Generally, correla- 
tions decay exponentially with the generation number. We find that two distinct regimes of behavior 
exist. For mutation rates smaller than a critical rate, the underlying tree morphology is almost irrel- 
evant, while mutation rates higher than this critical rate lead to strong tree-dependent correlations. 
We show analytically that identical critical behavior underlies all multiple point correlations. This 
behavior generally characterizes branching processes undergoing mutation. 

PACS numbers: 87.10+e, 87.15.Cc, 02.50.-r, 87.23.Kg 
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Biological evolution is influenced by a number of pro- 
cesses including population growth, mutation, extinction, 
and interaction with the environment, to name a few |Q. 
Genetic sequences are strongly affected by such processes 
and thus provide an important clue to their nature. The 
ongoing effort of reconstructing evolution histories given 
the incomplete set of mapped sequences constitutes much 
of our current understanding of biological evolution. 

However, this challenge is extraordinary as it involves 
an inverse problem with an enormous number of degrees 
of freedom. Statistical methods such as maximum like- 
lihood techniques coupled with simplifying assumptions 
on the nature of the evolution process are typically used 
to infer the structure of the underlying evolutionary tree, 
i.e., the phylogeny 

Genetic sequences such as RNA/DNA or amino acid 
sequences can be seen as words with letters taken from 
an alphabet of 4 or 20 symbols, respectively. Generally, 
there are nontrivial intra-sequence correlations that in- 
fluence the evolution of the entire sequence. Addition- 
ally, the structure of the evolutionary tree plays a role 
in this process as one generally expects that the closer 
sequences are on this tree, the more correlated they are 
H . In this study, we are interested in describing the in- 
fluence of the latter aspect, namely the phylogeny, on the 
evolution of sequences. Specifically, we examine correla- 
tions between sequences, thereby complementing related 
studies on changes in fluctuations and entropy due to 
the phylogeny |?],|8||. To this end, we consider particu- 
larly simple sequences and focus on a model that mimics 
the competition between the fundamental processes of 
mutation and duplication. 

The rest of this paper is organized as follows. In Sec. II, 
the model is introduced, and the main result is demon- 
strated using the pair correlations. Correlations of ar- 
bitrary order are obtained and analyzed asymptotically 
in Sec. III. To examine the range of validity of the re- 
sults, generalizations to stochastic tree morphologies and 
sequences with larger alphabets are briefly discussed in 
Sections IV and V. Sec. VI discusses implications for mul- 
tiple site correlations in sequences with independently 
evolving sites. We conclude with a summary and a dis- 
cussion in Sec. VII. 



Let us formulate the model first. The sequences are 
taken to be of unit length and the corresponding alpha- 
bet consists of two letters. The numeric values a = ±1 
are conveniently assigned to these letters. We will focus 
on binary trees where the number of children equals two. 
This structure is deterministic in that both the number 
of children and the generation lifetime are fixed. Never- 
theless, the results apply qualitatively to stochastic tree 
morphologies as well. Finally, the mutation process is 
implemented as follows: with probability 1 — p a child 
equals his predecessor while with probability p a muta- 
tion occurs, as illustrated in Fig. 1. The mutation pro- 
cess is invariant under the transformation a —* — a and 
p — > 1 — p, and we restrict our attention to the case 
< p < 1/2 without loss of generality. 




Fig.l The mutation process on a two-generation binary tree. 
The multiplicative variable r indicates whether a mutation 
occurred. 

A natural question is how correlated are the various 
leafs (or nodes) of a tree in a given generation (or equiv- 
alently, time)? Consider G2(k) the average correlation 
between two nodes at the fcth generation 



G 2 (k) = ((a.^)). 



(1) 
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The first average should be taken over all realiza- 
tions for a fixed pair of nodes i ^ j, while the sec- 
ond average is taken over all different pairs belong- 
ing to the same generation. For example, consider 
this quantity at the second generation (see Fig. 1), 
G 2 (2) = [(0-30-4) + (030-5) + (o- 3 o 6 )]/3. One index (i = 3) 
may be fixed since all nodes in a given generation are 
equivalent. 

To evaluate averages, it is useful to assign a multiplica- 
tive random variable = ±1 to every branch of the tree 
such that Oi = o^r, with j the predecessor of i. One has 
t,; = 1 (—1) with probability 1 — p (p), and consequently, 



1 - 2p. 



(2) 



Pair correlations are readily calculated using the r 
variables: writing 0-3 — O0T1T3 and similarly for 0-4 
gives (0-30-4) = (0-0T1T30-0T1T4) = (ct^t 1 2 T3T 4 ). Since 
of = r 2 = 1, this correlation simplifies, (0-30-4) = (T3T4). 
Furthermore, mutation processes on different branches 
are independent and consequently (TiTj) = (Ti)(Tj) 
when i 7^ j. Thus, (0-30-4) = (r) 2 and similarly 
(0-30-5) = (0-30-6) = (r) 4 . The overall picture becomes 
clear: when calculating two-point correlations, the path 
to the tree root is traced for each node. As t 2 = 1, dou- 
bly counted branches cancel. Only branches that trace 
the path to the first common ancestor are relevant. In 
other words, 



{(TiCTA 



(3) 



with dij the "genetic distance" between two points, 
the minimal number of branches that connect 
two nodes. Indeed, at the second generation 
c?3.4 = 2, (^35 = d,3 t 6 — 4 and consequently 
G 2 (2) = (a 2 + 2a 4 )/3 with the shorthand notation 
a = (r) = 1 — 2p. This generalizes into a geometric series 
G 2 (fc) = (a 2 + 2a 4 + ■ ■ ■ + 2 k - 1 a 2k )/(2 k - 1). Evaluat- 
ing this sum gives the pair correlation 



G 2 (k) 



(2a 



2\k 



1 



2a 2 



1 2 k -I 



(4) 



Interestingly, pair correlations are not affected by the ini- 
tial state, i.e., the value of the tree root. 

For sufficiently large generation numbers, the leading 
order of the pair correlation decays exponentially with 
the generation number. However, different constants 
characterize this decay, depending on the mutation prob- 
ability 



G 2 (k) 



2a 2 



■a AK p < p c ; 



T^ 2 k P>Pc 



(5) 



As seen from Eq. (|J) , the transition between the two dif- 
ferent behaviors occurs when 2a 2 = 1 or alternatively at 
the following mutation probability 



Pc 



1 



1 

V2 



(6) 



Although in general correlations decay exponentially 
G 2 (fc) ~ /3 2fe , the decay constant (3 exhibits two distinct 
behaviors which depend on the mutation probability a. 
When the mutation probability is smaller than the crit- 
ical one p < p c then (3 = a while in the complementary 
case (3 = 1/V2. 

As a reference, it is useful to consider the decay of 
the average node value G\(k) = (a). At the fcth gener- 
ation, the path to each node involves k branches and 
thus, Gi(fc) = Gi(0)a fe with Gi(0) = (o ). Writing 
Gi(k) ~ f3 k then (3 = a for all mutation probabilities, 
in contrast with the asymptotic behavior of G 2 (/c). Be- 
low the critical mutation rate, G 2 (fc) cx [Gi(fc)/Gi(0)] 2 , 
indicating that knowledge of the one-point average suf- 
fices to characterize correlations. 




Fig. 2 The trivial "star" Phylogeny. The path connecting 
two nodes always contains the tree root. 



In fact, the above behavior can be attributed to the 
tree morphology. To see that, it is useful to consider a 
structureless morphology where the only ancestor shared 
by two nodes is the tree root itself (see Fig. 2). Us- 
ing the notation G* to denote correlations on this "star" 
morphology, we see that the average remains unchanged 
Gi(k) = G\{k) = G\{Q)a k . The star morphology is triv- 
ial in that all genetic distances are equal: dij = 2k when 
i =/= j. Thus, pair correlations are immediately obtained 
from the average G* 2 (k) = [G? (£:)/Gt (0)] 2 = a 2k . As 
branches in the star morphology do not interact, no cor- 
relations develop. 

In contrast, nontrivial phylogenies do induce corre- 
lations. Indeed, G 2 (fc) > G 2 (fc) when p > 0. In- 
terestingly, when p < p c , merely the asymptotic pref- 
actor a 2 /{2a 2 — 1) > 1 in Eq. (||) is enhanced and 
G 2 (fc) cx G 2 (/c). As the critical point is approached, this 
constant diverges thereby signaling the transition into a 
second regime. When p > p c , the decay constant itself 
is enhanced and the ratio G 2 (/c)/G 2 (fc) grows exponen- 
tially. The mutation probability affects only the asymp- 
totic prefactor, and the decay constant (3 = 1/^/2 is de- 
termined by the tree morphology. We conclude that the 
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nontrivial phylogeny generates significant correlations for 
larger than critical mutation probabilities. 

This behavior can be understood and partially red- 
erived using a heuristic argument. Genetically close 
nodes are highly correlated, while distant pairs are 
weakly correlated, as indicated by Eq. (||). On the other 
hand, distant pairs are more numerous. Both effects 
are magnified exponentially for large generation numbers, 
and their competition results in a critical point. Differ- 
ent mechanisms dominate on different sides of this point. 
Specifically, the number of minimal genetic distance pairs 
(d = 2) is 2 fc_1 , while the number of maximal distance 
pairs (d — 2k) is 2 2 ( k ~ 1 \ The rule (||) gives the relative 
contributions of these two terms to the overall two-point 
correlation: 2 k ~ 1 a 2 versus 2 2t ^ k ~ 1 ^ a 2k . These are simply 
the first and last terms in the geometric series that led to 
Eq. (|J). Comparing these two terms in the limit k — > oo 
correctly reproduces the most relevant aspects, i.e., the 
location of the critical point (||) and the decay constants 
of Eq. (|5|). We conclude that competition between the 
multiplicity and the degree of correlation of close and 
distant nodes underlies the transition. 



III. HIGHER ORDER CORRELATIONS 

The above analysis gives useful intuition for the over- 
all qualitative behavior. Yet, it can be generalized into a 
more complete treatment that addresses correlations of 
arbitrary order. This set of quantities is helpful in de- 
termining the extent to which this picture applies, and 
in particular, whether the transition is actually a phase 
transition. 

Multiple point correlations obey a rule similar to 
Eq. (||). For example, consider the four-node average 
( (73174 (J5(J6) in Fig. 1. Using the r variables, we rewrite 
(a 3 a 4 a 5 a 6 ) = (<ToT?t$t 3 t 4 t 5 t 6 ) , and since a 2 = t 2 = 1 
we get ((73(74(7506) = (t 3 t 4 t 5 t 6 ) = (r) 4 or (a 3 a4,cr 5 a 6 ) = 
((73(74) ((75(76)- The four point average equals a product 
of two-point averages with the indices chosen as to min- 
imize the total number of branches. This can also be 
seen by tracing the path of each node to the tree root 
and canceling doubly counted branches. Thus, Eq. (||) 
generalizes as follows: 

{aurjakot) = (r) d ^, (7) 

with the four-point genetic distance 

di,j,k,l = min{d j;j + d kil ,d i:k + djj,d it t + d Jtk }. (8) 

Similarly, the law for arbitrary order averages is (r) 
raised to a power equal to the 77-point genetic distance. 
Such distance is obtained by considering all possible de- 
compositions into pairs of nodes. The genetic distance 
is the minimal sum of the corresponding pair distances. 
Averages over an odd number of nodes can be obtained 
by adding a "pseudo" node at the root of the tree and us- 
ing the convention enroot = k when i belongs to the fcth 



generation. The average (<7o) is generated by the root 
and this factor multiplies all odd order correlation. Since 
even order correlations are independent of the root value, 
and odd correlations are simply proportional to ((To), we 
set (co) = 1 in what follows without loss of generality. 
The average 77-point correlation is defined as follows 

G n (k) = ({a il a il ---a in )), (9) 

where the averages are taken over all realizations and 
over all possible choices of n distinct nodes at the fcth 
generation. For the trivial star phylogeny, the n-point 
genetic distance is constant and equals a product of the 
correlation order and the generation number, d = nk. 
Consequently, all averages are trivial as knowledge of the 
one-point average immediately gives all higher-order av- 
erages, G*(fc) = [Gi(fc)] n , or explicitly 

G* n (k)=a nk . (10) 

When the tree morphology is nontrivial, the minimal- 
sum rules (0)-(g) imply that such factorization no longer 
holds. For binary trees, it is possible to obtain these cor- 
relations recursively. Let us assign the indices 1, 2, . . . , 2 k 
to the fcth generation nodes and order them as follows 
1 < i\ < %i < • • • < i n < 2 k . As the average over the 
realizations is performed first, the average correlation re- 
quires a summation over all possible choices of nodes 

F n(k) = ^2 (a {l a i2 ■ ■ ■ a in ). (11) 

l<i 1 <i 2 <---<in<2 k 

Proper normalization gives the n-node correlation 

G n (k) = F n (k) / V (12) 

Consider a group of n nodes taken from the kth gen- 
eration. They all share the tree root as a common 
ancestor. The two first generation nodes naturally di- 
vide this group into two independently evolving sub- 
groups. This partitioning procedure allows a recursive 
calculation of the correlations. Formally, a given choice 
of nodes 1 < i\ < 12 <••■<««< 2 k is partitioned into 
two subgroups as follows 1 < i\ < • • • < i m < 2 fc_1 and 
2 fc-i + i< i m+1 <■■■ <i n < 2 k ~ 1 + 2 fe - 1 . These sub- 
groups involve different r variables, so their correlations 
factorize 

((Ti t ■■■<r in ) cx (a i± ■ ■ ■ (Ti m )(a im+1 ■ ■ ■ a in ). (13) 

The proportionality constant depends upon the parity 
of to and 71 — 777. Even correlations are independent 
of the tree root, while odd correlations are proportional 
to the average value of the tree root. This extends to 
sub-trees as well, and since <7o = 1, the average value 
of the root of both sub-trees is (r). This factor ac- 
companies all odd correlations. Substituting Eq. ( |l3| ) 
into Eq. (0) shows that the summation factorizes as 
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well. Using F m (k - 1) = X)i<; 1 <---<i„ l <2 fc - 1 ' ■ ' < J ^m) 
reduces the problem to two sub-trees that are one gener- 
ation shorter, and a recursion relation for F n (k) emerges 



F n {k) = F ^ k - VB m F n _ m (k - l)B n 



(14) 



m— 



with the boundary conditions F n (0) = 5 n< o + S n .i- The 
summation corresponds to the n + 1 possible partitions 
of a group of n nodes into two subgroups. The weight of 
the odd correlations is accounted for by B n 



B„ 



1 n = 2r; 
(t) n = 2r + l. 



(15) 



Using the definition (|Tl|), the sums F n (k) vanish when- 
ever n > 2 k . This behavior emerges from the recursion 
relations as well. Additionally, one can check that the 
sums are properly normalized in the no mutation case 
(a = 1), F n (k) = ( 2 *) when n < 2 k . 

For sufficiently small n, it is possible to evaluate the 
sums explicitly using Eqs. (|14| ). The average correlations 
are then found using Eq. (|12| ) 

Go (*0 = 1, 



Gi(fc) 
G 2 (fc) 

G 3 (fc) 



(2a 



2\k 



1 



2a 2 -1 2 k -I 

k+2 (4a 3 )*-(4« 3 

6a 



(16) 



- (2 fe - 2) 



2a 2 -1 (2 fe -l)(2 fe -2) 



Indeed, these quantities agree with the previous results 
for n = 1, 2 and equal unity when p — 0. We see that cor- 
relations involve a sum of exponentials. Furthermore, it 
appears that the condition 2a 2 = 1 still separates two dif- 
ferent regimes of behaviors. However, calculating higher 
correlations explicitly is not feasible as the expressions 
are involved for large n. Instead, we perform an asymp- 
totic analysis that more clearly exposes the leading large 
generation number behavior. 

Let us consider first the regime p < p c or equivalcntly 
2a 2 > 1. From Eq. (16), we see that the leading large k 



behavior of the average correlation satisfies G„ (k) ~ a 
for n = 0, 1, 2, and 3. We will show below that this be- 
havior extends to higher order correlations, i.e., 



Gn(k) ~ g n a 



nk 



(17) 



In other words, the following limit a = 
limfc^ 00 [G n (fc)] 1 /"' c exists and is independent of n. As 
correlations are larger when the phylogeny is nontriv- 
ial, one expects that G n (k) > G* (k) or in terms of the 
prefactors, g n > g* = 1. Combining Eq. ( |l2] ) with the 
leading behavior of the combinatorial normalization con- 



stant (J 
the sums 



2 /n! gives the the asymptotic behavior of 



F n (k) = f n (2a) nk , with /„ = 



9n 



(18) 



Substituting Eq. (18) into the recursion relation 
Eq. ( ^4] ) eliminates the dependence on the generation 
number k, and a recursion relation for coefficients /„ is 
found 



fn(2a^j ^ fmB m f n — m ]3 n - 



(19) 



m— 



with B n of Eq. (jl5|). These recursion relations are con- 
sistent with the conditions fo — fi — 1. The case 
n = 2 reproduces the coefficient f 2 = a 2 /[(2a) 2 — 2]. 
The divergence at 2a 2 = 1 indicates that the ansatz 
( p7| ) breaks down at the critical point. To show that 
the ansatz holds in the entire range < p < p c , one 
has to show that the coefficients /„ are positive and fi- 
nite for all n. Rewriting the recursion (09) explicitly 



f n [{2a) n -2B„ 



^l2/m— 1 fm-Bmfn — jnBn 



allows us to 



prove this. Since /o = 1 > 0, then to complete a proof 
by induction one needs to show that a positive f n -i im- 
plies a positive f n . The right hand side of the recursion is 
clearly positive and thus the positivity of f n hinges on the 
positivity of the term (2a) n — 2B n . When 2a 2 > 1, then 
a > 1/ v2 and certainly 2a > 1. Combining this with the 
inequality (2a) 2 > 2 > 2B n shows that (2a)™ - 2B n > 
when n > 2. Hence /„ is positive and finite for all n, 
which validates the ansatz ( |l7|) in the regime p < p c . 

In principle, the coefficients can be found by introduc- 
ing the generating functions 



/(*) = £/« 



(20) 



Multiplying Eq. (|9|) by z n and summing over n yields 
the following equation for the generating functions 



f(2az) 



f(z)+f(-z) f(z)~f(-z) 
z h a 



(21) 



This equation reflects the structure of the recursion re- 
lations. A factor a is generated by each odd-index co- 
efficient and as a results, the odd part of the generating 
functions [f(z) — f(—z)]/2 = f\Z + f%z 3 + ■ ■ ■ is multi- 
plied by a. Although a general solution of this equation 
appears rather difficult, it is still possible to obtain re- 
sults in the limiting cases. It is useful to check that when 
a = 1, the above equation reads f(2z) — f 2 (z) which 
together with the boundary conditions /o = A = 1 gives 
f(z) — exp(z) or /„ = -jr. As g n — > 1, the trivial correla- 
tions are recovered, G n — * G* indicating that role played 
by the tree morphology diminishes in the no mutation 
limit. 

In the limit p — > p~ it is possible to extract the leading 
behavior of the asymptotic prefactors. Here, it is suffi- 
cient to keep only the highest powers of the diverging 
term l/(2a 2 — 1). The calculation in this case is identi- 
cal to the one detailed below for the case p > p c and we 
simply quote the results 
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G n {k) 



2H 
r! 



2(2a 2 -l) 
(2r+l)! 



2(2a 2 -l) 



n = 2r; 
lfc n = 2r 



1. 



(22) 



In this limit, the odd order correlations simply fol- 
low from their even counterparts and for example 

/2r+l = fir- 

In the complementary case p > p c , it proves useful to 
rewrite the recursion relations (|l9|) for the even and odd 
correlations separately 



F 2r (k) = J2 F 2s(k - l)F 2r ^ 2s (k - 1) 

s=0 
r-1 

+ a 2 Y, F 2s+1 {k - l)F 2r _2s-i(k - 1) 



(23) 



s=0 



F 2r+1 (k) = 2a^F 2s {k- l)F 2r _ 2s+1 (fc- 1). 



s=0 



The leading asymptotic behavior of Eq. (jig ) implies 
Fo(fc) = f Q , Fi(A) ~ /o(2a) fc , F 2 (fc) ~ / 2 2 fc , and 
F 3 (k) ~ / 2 2 fe (2a) fc with /o = 1 and / 2 = a 2 / [2 - (2a) 2 ]. 
Let us assume that this even-odd pattern is general 



F 2r (k) = f 2r 2 rk , 
F 2r+ i{k) = f 2r 2 rk (2a) k . 



(24) 



Substituting this ansatz into Eq. ( |2 3[ ) shows that the sec- 
ond summation in the recursion for the even correlations 
is negligible asymptotically. Both equations reduce to 



f 2r 2 r — ^ f 2s f 2r - 



(25) 



and therefore the pattern (|2J) holds when p > p c . It is 
seen that odd correlators are enslaved to the even ones. 

To obtain the coefficients, we introduce the generat- 
ing functions f(z) = J2 r hrZ 2r which satisfies /(0) = 1, 
/' (0) = and /"(0) = h = a 2 /[2(l - 2a 2 )}. The re- 
cursion relation translates into the following equation for 

/(*) 



f(V2z) = [f(z)f 



(26) 



Its solution is f(z)=expUaz) 2 /2(l-2a 2 )}. Thus, 
f 2r = ±r[h] r . From Eqs. the leading asymp- 

totic behavior in the regime p c < p < 1/2 is found 



G n (k) 



2r! 
r! 



2(l-2a 2 ) 
(2r+l)! 



2(l-2a 2 ) 



2~ kr 

r a k 2 -kr 



n = 2r; 
n = 2r 



(27) 



Using the Stirling formula n\ ~ \f2im n n e " it is seen 
that the coefficients <72r have nontrivial r behavior as 
92r = g 2r +i/(2r + 1) ~ V2[2a 2 /{1 - 2a 2 )] V. 



The even order correlations have identical 
asymptotic behavior to the two point correla- 
tion: limfc^G^fc)] 1 / 21 * = ^5 for all r. The 
odd order correlations behave differently, however, 
as this limit depends on the correlation order: 
lim fc ^ 0O [G 2j . +1 (fc)] 1 /(2-+i)fc = _^ ( ^2 a) i/2r+i. ThuS) 

only in the limit r — > oo do the even and odd order cor- 
relations agree. However, this conclusion is misleading 
since the decay rate of the (properly normalized) odd 
order correlations G 2r +i(k) / G\(k) ~ G 2r (k) is identical 
to that of the even order correlations. We conclude that 
the decay rate of two-point correlations characterizes the 
decay of all higher order correlations. 

From Eqs. (j^) and fl27|), we see that the coefficients 
diverge according to 



f 2 r — hr+l ~ \Pc 



■p\ 



(28) 



as the critical point is approached, p — > p c . Since 
the correlations must remain finite, this indicates that 
the purely exponential behavior must be modified when 
p = p c . Indeed, evaluating Eq. (|lq ) at p — p c yields 
F 2 (k) ~ / 2 2 fc and F 3 (k) = / 2 2^/ 2 with f 2 = */4, 
i.e., the even-odd pattern of Eq. (|24| ) is reproduced. 
Furthermore, the value of / 2 shows that the diverging 
quantity 1/|1 — 2a 2 1 is simply replaced by k. This im- 
plies that the coefficients become generation dependent, 
fn fn{k). Assuming the pattern Eq. (|2^), substitut- 
ing it into Eq. (25), and following the steps that led to 
Eq. (E7j) yields the critical behavior 



G n (k) 



— kr 



(2r+l) 



rky 2-*(r+i/2) 



n = 2r; 
n = 2r 



1. 



(29) 



Generally, the diverging quantity 1/|1 — 2a 2 1 is replaced 
with the finite (but ever growing) quantity k. The al- 
gebraic modification to the leading exponential behavior 
in Eq. (|2^) is reminiscent of the logarithmic corrections 
that typically characterize critical behavior in second or- 
der phase transitions M. 



IV. STOCHASTIC TREE MORPHOLOGIES 

The question arises: how general is the behavior de- 
scribed above? The binary tree considered was particu- 
larly simple as it involved a fixed number of children and 
a fixed generation lifetime. Below we show that relaxing 
either of these conditions does not affect the nature of 
the results. 

Let us first consider tree morphologies with a vary- 
ing number of children, i.e., the trees are generated by 
a stochastic branching process where with probability 
P r there are r children. This probability sums to unity 
^ r P r = 1, and the average number of children is given 
by (r) = ^ r TP r - As a result, the average number of 
nodes at the fcth generation is (r) k , indicating that the 
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tree "survives" only if (r) > 1, a classical result of branch- 
ing processes theory jl(| . The rule (||) is independent of 
the tree morphology, and therefore, one can repeat the 
heuristic argument in Sec. II. The extreme contributions 
to the average pair correlations have the relative weights 



For the star phylogeny the genetic distance is always t 



\k-l. 



and 



\2(fc-l) 9/c 



a . Comparing these two terms 



asymptotically shows that the critical point is a simple 
generalization of Eq. (0) 



1 




(30) 



The critical mutation rate varies from to 1/2 as the 
average ancestry size varies between 1 and oo. This indi- 
cates that correlations are significant over a larger range 
of mutation rates for smaller trees. The heuristic argu- 
ment also gives the decay constant /?, and the leading 
asymptotic behavior of Eq. @ is generalized by simply 
replacing 2 with (r) . A more complete treatment of this 
problem is actually possible and closely follows Eq. (|4|). 
Again, the ancestry size (r) replaces the deterministic 
value 2. As both the results and the overall behavior 
closely follow the deterministic case, we do not detail 
them here. 

A second possible generalization is to morphologies 
with a varying generation lifetime. Such tree morpholo- 
gies can be realized by considering a continuous time 
variable. Branching is assumed to occur with a con- 
stant rate v. For such tree morphologies, the number 
of nodes n(t) obeys n(t) = vn(t) which gives an ex- 
ponential growth n(i) = e . Similarly, the mutation 
process is assumed to occur with a constant rate 7. A 
useful characteristic of this process is the autocorrelation 
A[t) = (a(0)a(t)). To evaluate it's evolution, we note 
that A(t + dt) = (1 - jdt)A(t) - jdtA(t) when dt -> 0. 
Therefore, A(t) = ~2jA(t) and one finds A(t) = e" 27 *. 
The quantities n(t) and A(t) allow calculation of the av- 
erage pair correlation. 

Let us pick two nodes at time t and denote their val- 
ues by <Ti(t) and <Jj(t), and let the genetic distance be- 
tween these two nodes be r. Using their first common 
ancestor a c (t — r) = Oi(t — r) = crj(t — t) and the 
identity a 2 = 1, their correlation can be evaluated as 
follows (cri(t)(Tj(t)) = (<Ti(t)a c (t - T)a c (t - T)<Tj(t)) = 
(ai(t)<Ji(t~T))(aj(t)<jj(t — T)) — A 2 (t). Integrating over 
all possible genetic distances gives the average pair cor- 
relation 



Ga(t) 



f drn{r)A 2 {r) 
Jo dTn ( T ) 



(31) 



The factor n(r)/ c?Tn(r) accounts for the multiplicity 
of pairs with genetic distance r. Using A(t) — e -47 * and 
n{r) = e ut , the average pair correlation is evaluated 



G 2 (t) = 



v — 47 



1 



(32) 



and therefore G%(f) 



-4jt 



Here the relevant param- 



eter is the normalized mutation rate u> — 7/V- Again, 
there exists a critical point ui c = 1/4. For smaller than 
critical mutation rates, to < u> c , correlations due to the 
tree morphology are not pronounced, G%(t) oc G%(t). On 
the other hand, when u> > uj c , strong correlations are 



generated and G^fc) 



is exponentially larger than 



G^it). We conclude that the behavior found for the de- 
terministic case is robust. 



V. MULTISTATE SEQUENCES 

We now consider larger alphabets. Previously, the two 
states satisfied a 1 = 1. A natural generalization is to 
a n = 1, i.e, the nth order roots of unity a — e l2jr '/™ with 
1 = 0, l,...,n — 1. Previously, with probability p the mu- 
tation a — * Tcr occurred with r = e l6 and 9 — ir. We thus 
impose the same transition but with 9 = 2n/n. This can 
be viewed as a clockwise rotation in the complex plane 
by an angle 9. Since the states are now complex, the 
definition of the pair correlation is now 



G 2 (k) = ((a i a j )), 



(33) 



with a the complex conjugate of a. The real part of Oi<jj 
gives the inner product of the two-dimensional vectors 
corresponding to <7i and <jj , respectively. 

Consider the average (03 04) in Fig. 1. Us- 
ing the t variables and fr = da = 1 one has 

(CT3CT4) = (o-ofif3CToTiT4) = (t 3 t 4 ) = (f 3 )(r 3 ) = 

(r) (t) = I (t) 1 2 . All of our previous results hold 
if one replaces the average (r) with its magnitude 
a = |(r)| = |1 -p(l - e w )\ = y/l - 2p(l - p) (1 - cos 9). 
Furthermore, it is sensible to consider arbitrary phase 
shifts < 9 < 2ir since the identity 00 = fr = 1 rather 
than a n = T n = 1 was used to evaluate correlations. 

The critical point is determined from the condition 
(r)a 2 = 1. This equation has a physical solution only 
when 2ip < 9 < 2(ir — ip) with the shorthand notation 




(34) 



In terms of the number of states, this translates to 



TT 7T 

<n< —. 

TT — <p ifi 



(35) 



Hence, the transition may or may not exist depending on 
the details of the model, in this particular "clock" model 
case, the number of states. As we have seen before, cor- 
relations become less pronounced when the number of 
ancestors increases. Indeed, the transition always exists 
in the limit (r) — » 1, while the transition is eliminated 
in the other extreme (r) — > 00. When the transition do 
occur, the following critical mutation probability is found 



G 



sin 2 <p 



sm 



2 



(36) 



Indeed, Eq. ( |30|) is reproduced in the two state case 
(9 = t:). This turns out to be the minimal critical point, 
p c > (1 — y/l/(r))/2, reflecting the fact that transition 
(j — > — c provides the most effective mutation mechanism. 

Interestingly the transition is restored when both 
the mutation and the duplication processes occur 
continuously in time. In this continuous descrip- 
tion duplication occurs with rate v and the muta- 
tion a — > e l9 a occurs with rate 7. The autocorre- 
lation A(t) = (a(0)a(t)) = exp [-7(1 - e - ld )t] is found 
from its time evolution A(t) = —7(1 — e~ %e )A(t). It 
can be easily shown from the definition of the pair 
correlation (|3^) that A 2 (t) should be replace with 
j A(t)\ 2 = exp [-27(1 -cos 6)t] in the integral @. 



Comparing with the previous section results, we see that 
the effective mutation rate is now 7(1 — cos#)/2. As a 
result, the location of the critical point is increased by a 
factor 2/(1 — cos 6). Using the normalized mutation rate 
lu = "f /i> one finds 



1 



2(1 



(37) 



This critical point increases with the number of states, 
and it diverges according to ui c ~ (n/27r) 2 when n — > 00. 
This behavior is intuitive as one expects that mutations 
between a large number of states diminishes correlations 
and consequently, phylogenetic effects. 



phylogeny, one needs to consider fluctuations, i.e., the 
variance Ap = (p 2 ) 



ij 



N 4 ^ 



kl 



&3 l =£] 



The first equality in the above equation was obtained 
by rewriting Eq. (|3^ ) as p = pi — p 2 and noting that 
{piPi) — (p 2 )- The final expression can be simplified 
using £^(T) 2d - = N(N- l)G 2 {a 2 ,k) with G 2 (a 2 ,k) 
the pair correlation of Eq. (^) , considered as a function of 
a 2 . The following expression for the variance is obtained 



Ap= [^ + (l-i)G 2 (a 2 ,fe) 



~ + (l-~)<fe(o^ 
(39) 



For the star morphology the leading order of the fluctu- 
ations is independent of the mutation rate and it scales 
as the familiar N^ 1 . For the binary tree morphology, 
there are again two regimes, characterized by p > p c or 
p < p c where p c is now defined by 2a 4 — 1, i.e., 



I - 



(40) 



VI. TWO-SITE CORRELATIONS 

When sequences are not of unit length, i.e., when there 
are two or more sites per sequence, the results can be used 
to characterize a correlation measure quantifying the in- 
teraction between sites. Assume there are two or more 
sites per sequence and that the sites evolve independently 
of each other. Denote the state of position a in sequence 
i as of , and similarly denote the state of position b in 
sequence i as a\. If the sequences were not related by a 
phylogenetic tree, but instead were independent samples 
drawn from a given distribution, then the following quan- 
tity defined on a finite set of TV = 2 fe samples specifies a 
two-site correlation measure: 



P 



1 \ " ^ a J> 



1 

N 2 



E< 



E4 



(38) 



Correlation between sites a and b is indicated by a non- 
zero value of p. 

The quantity p is well defined also when the sequences 
are related by a phylogenetic tree. Due to the assump- 
tion of independent positions, the mean of p over all re- 
alizations vanishes (p) =0. This behavior is indepen- 
dent of the tree morphology. To see the effects of the 



When p < p c , the phylogeny plays a significant role and 
the variance is exponentially enhanced Ap ~ a 4k , while 
when p > pc, the variance is still statistical in nature 
Ap ~ AA* p with A > 1. Hence, it is more likely to 
observe large values of p in the tree morphology than 
it is in the star morphology, even when the sites evolve 
independently. Since correlations and variance play op- 
posite roles, they are influenced in different ways by the 
phylogeny. 



VII. SUMMARY 

In summary, we have studied the influence of the phy- 
logeny on correlations between the tree's nodes. In gen- 
eral, for sufficiently small mutation rates, the morphology 
plays a minor role. For sufficiently high mutation rates 
large correlations that can be attributed to the phylogeny 
may occur. The transition between the two regimes of 
behavior is sharp and is marked by a critical mutation 
rate. Below this critical point all correlations are well de- 
scribed by the average, while above it, correlations decay 
much slower than the average. Underlying this transi- 
tion is the competition between the multiplicity and the 
the degree of correlations between genetically close and 
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distant leafs. This competition also leads to larger fluc- 
tuations in the correlation between different sites, even 
when these evolve independently. 

We have also seen that this behavior is robust and ap- 
pears to be independent of many details of the model. 
While the overall behavior generally holds, specific de- 
tails such as the location of the critical point and the 
decay rate in the regime p > p c depend on a specific tree 
dependent parameter: the average number of children. 

The above results can be extended in several direc- 
tions. It will be interesting to see whether the recursive 
methods can be generalized to stochastic tree morpholo- 
gies and in particular to the continuous time case. This 
methods should still be applicable even when the mu- 
tation rates are time dependent or disordered. In such 
cases it will be interesting to determine which parameters 
determine the critical point, the decay constants, etc. 

Correlations can serve as useful measure of the diver- 
sity of a system since small correlations indicate large di- 
versity and vice versa. If the diversity can be measured in 
an experiment where the phylogeny is controlled, its time 
dependence can be used to infer the mutation probability. 
Similarly, if the mutation probability can be controlled, 
than the degree of correlation/diversity can be used to 
infer characteristics of the phylogeny. Thus, our results 
may be useful for inferring statistical properties of actual 
biological systems. 

This research is supported by the Department of En- 
ergy under contract W-7405-ENG-36. 
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